Univariate TS Models (ARIMA/SARIMA)s

Published

October 2, 2025

Code
library(tidyverse)
library(ggplot2)
library(forecast)
library(astsa) 
library(xts)
library(tseries)
library(fpp2)
library(fma)
library(lubridate)
library(tidyverse)
library(TSstudio)
library(quantmod)
library(tidyquant)
library(plotly)
library(ggplot2)
library(dplyr)
library(seasonal)
library(patchwork)
Code
la_teu_data <- read_csv("data-raw/la_teu_data.csv", show_col_types = FALSE)
head(la_teu_data)
# A tibble: 6 × 9
  Date   Loaded_Imports Empty_Imports Total_Imports Loaded_Exports Empty_Exports
  <chr>           <dbl>         <dbl>         <dbl>          <dbl>         <dbl>
1 Jan-24        441763.          179        441942.        126554.       287156.
2 Feb-24        408764.          139        408903.        132754.       239776.
3 Mar-24        379542.         1018.       380560         144718.       218139.
4 Apr-24        416929.          253        417182.        133046.       220109 
5 May-24        390663.          468.       391131.        125963.       235800 
6 Jun-24        428753.          545        429298.        122515        275944.
# ℹ 3 more variables: Total_Exports <dbl>, Total_TEUs <chr>,
#   Prior_Year_Change <chr>

For Total TEUs

Code
df <- la_teu_data %>%
  transmute(
    Date = coalesce(suppressWarnings(my(Date)),    
                    suppressWarnings(ymd(Date))), 
    Total_TEUs = parse_number(Total_TEUs)          
  ) %>%
  arrange(Date) %>%
  filter(!is.na(Date), !is.na(Total_TEUs))

win_start <- ymd("2000-01-01")
win_end   <- ymd("2024-12-01")

df_teu <- df %>%
  filter(Date >= win_start, Date <= win_end) %>%
  mutate(
    TEUs = Total_TEUs,
    log_TEUs = log(Total_TEUs + 1)
  ) %>%
  select(Date, TEUs, log_TEUs)

head(df_teu)
# A tibble: 6 × 3
  Date          TEUs log_TEUs
  <date>       <dbl>    <dbl>
1 2000-01-01 358234.     12.8
2 2000-02-01 336925.     12.7
3 2000-03-01 357267.     12.8
4 2000-04-01 407032.     12.9
5 2000-05-01 408035.     12.9
6 2000-06-01 402557.     12.9
Code
###### Plot Original Data using Plotly ######
plot_ly(df_teu, x = ~as.Date(Date), y = ~TEUs, type = 'scatter', mode = 'lines',
        line = list(color = '#ffa988')) %>%
  layout(title = "Orignial Time Series of Port of Los Angeles Container Throughput (TEUs) ",
         xaxis = list(title = "Date"),
         yaxis = list(title = "TEUs"))

The time series of container throughput (TEUs) at the Port of Los Angeles from 2000 to 2024 shows an overall upward trend with pronounced cyclical fluctuations, reflecting both long-term trade growth and sensitivity to global shocks. Throughput expanded steadily in the early 2000s before declining sharply during the 2008 financial crisis, after which it recovered and stabilized within a broad seasonal cycle. Beginning in 2020, the series exhibits extreme volatility: a sharp contraction during the onset of the COVID-19 pandemic was followed by record peaks in 2021 as consumer demand surged and supply chain congestion intensified. In recent years, volumes have moderated but remain elevated, underscoring both the vulnerability of the port to external disruptions and its resilience as a central hub in U.S.–China trade and global supply chains.

Code
ts_df <- ts(df_teu$TEUs, 
              start = decimal_date(as.Date(min(df$Date))), 
              frequency = 12)  # Assuming monthly data
ts_log <- ts(df_teu$log_TEUs, 
              start = decimal_date(as.Date(min(df$Date))), 
              frequency = 12)  # Assuming monthly data
plot(ts_df, main = "Time Series plot for orignial data")

Code
plot(ts_log, main = "Time Series plot for log data")

The log transformation does not yield a significant improvement in residual behavior, while complicating forecast back-transformation. Therefore, we retain the original scale of the data for analysis and forecasting.

ACF with original data

Code
# ACF Plot
ggAcf(ts_log) +
  ggtitle("ACF Plot of Port of Los Angeles Container Throughput") +
  theme_minimal()

ADF test

Code
################ ADF Test #############
tseries::adf.test(ts_log)

    Augmented Dickey-Fuller Test

data:  ts_log
Dickey-Fuller = -6.0505, Lag order = 6, p-value = 0.01
alternative hypothesis: stationary

The ACF plot of the original series shows a slow decay in autocorrelations across multiple lags, indicating strong persistence and nonstationarity. In addition, the AD F test yields a p-value of 0.01, which is below the 0.05 significance threshold, rejecting the null hypothesis of a unit root.

Taken together, these results confirm that the raw TEUs series of the Port of Los Angeles is non-stationary.

Code
diff_1 <- diff(ts_df)
diff_2 <- diff(ts_df,differences = 2)

# ACF Plots for seasonal without normal differenced Data
acf_plot1 <- ggAcf(diff_1, 40) +
  labs(title = "ACF of first differenced TEUs") 
acf_plot2 <- ggAcf(diff_2, 40) +
  labs(title = "ACF of second differenced TEUs") 

# Combine ACF Plots using patchwork
acf_plot1 / acf_plot2

The ACF plot of the first-differenced series shows that autocorrelations drop sharply after lag 1 and remain largely within the confidence bounds, indicating that first differencing has effectively removed the nonstationarity of the original TEUs series. In contrast, the second-differenced series does not provide additional improvement and may risk overdifferencing, which can distort the underlying structure of the data.

Therefore, a first-order difference is sufficient。

Code
diff_s <-diff(diff_1,12)

# ACF Plots for Differenced Data
acf_plot1 <- ggAcf(diff_1, 40) +
  labs(title = "ACF of first differenced TEUs") 
pacf_plot1 <- ggPacf(diff_1, 40) +
  labs(title = "PACF of first differenced TEUs")  

acf_plots <- ggAcf(diff_s, 40) +
  labs(title = "ACF of first differenced seasonal TEUs") 
pacf_plots <- ggPacf(diff_s, 40) +
  labs(title = "PACF of first differenced seasonal TEUs") 

# Combine ACF Plots using patchwork
acf_plot1 / pacf_plot1

Code
acf_plots / pacf_plots

q=0:1,Q=1:3,p=1:4,P=1,d=1,D=1

Code
SARIMA.c <- function(p1, p2, q1, q2, P1, P2, Q1, Q2, data) {
  d <- 1; D <- 1
  if (!is.ts(data)) data <- ts(data, frequency = 12)
  s <- frequency(data); if (is.null(s) || is.na(s) || s <= 1) s <- 12
  
  rows <- list()
  
  i <- 1
  for (p in p1:p2) {
    for (q in q1:q2) {
      for (P in P1:P2) {
        for (Q in Q1:Q2) {
          
          if (p + d + q + P + D + Q <= 9) {
            model <- tryCatch(
              Arima(
                data,
                order    = c(p, d, q),
                seasonal = list(order = c(P, D, Q), period = s)
              ),
              error = function(e) NULL
            )
            
            if (!is.null(model)) {
              rows[[i]] <- c(p, d, q, P, D, Q, model$aic, model$bic, model$aicc)
              i <- i + 1
            }
          }
        }
      }
    }
  }
  
  if (length(rows) == 0) {
    out <- setNames(
      as.data.frame(matrix(numeric(0), nrow = 0, ncol = 9)),
      c("p","d","q","P","D","Q","AIC","BIC","AICc")
    )
    return(out)
  }
  
  out <- as.data.frame(do.call(rbind, rows))
  names(out) <- c("p","d","q","P","D","Q","AIC","BIC","AICc")
  for (nm in names(out)) out[[nm]] <- as.numeric(out[[nm]])
  rownames(out) <- NULL
  
  out
}
Code
output <- SARIMA.c(
  p1=1, p2=4,
  q1=0, q2=1,
  P1=0, P2=1,
  Q1=1, Q2=3,
  data = ts_df
)

output[which.min(output$AIC), , drop=FALSE]
   p d q P D Q      AIC      BIC     AICc
41 4 1 1 0 1 1 7352.075 7377.692 7352.477
Code
output[which.min(output$BIC), , drop=FALSE]
  p d q P D Q      AIC      BIC     AICc
7 1 1 1 0 1 1 7355.846 7370.484 7355.987
Code
output[which.min(output$AICc), , drop=FALSE]
   p d q P D Q      AIC      BIC     AICc
41 4 1 1 0 1 1 7352.075 7377.692 7352.477

SARIMA(4,1,1)(0,1,1)[12]/(1,1,1)(0,1,1)[12] is best model for manual search.

Code
auto.arima(ts_df)
Series: ts_df 
ARIMA(0,1,4)(0,0,2)[12] with drift 

Coefficients:
          ma1     ma2      ma3      ma4    sma1    sma2      drift
      -0.6610  0.0216  -0.0746  -0.1645  0.1430  0.1425  1547.8423
s.e.   0.0591  0.0681   0.0742   0.0603  0.0578  0.0595   814.8587

sigma^2 = 7.576e+09:  log likelihood = -3822.47
AIC=7660.95   AICc=7661.45   BIC=7690.55

SARIMA(0,1,4)(0,0,2)[12] is best model for auto search.

Code
# Set seed for reproducibility
set.seed(345)

###### Fit SARIMA(1,1,1)(1,1,1)[12] Model ######
model_1_output <- capture.output(sarima(ts_df, 4, 1, 1, 0, 1, 1, 12))

Code
model_2_output <- capture.output(sarima(ts_df, 1, 1, 1, 0, 1, 1, 12))

Code
model_3_output <- capture.output(sarima(ts_df, 0, 1, 4, 0, 0, 2, 12))

Code
###### Extract and Print Relevant Output ######
extract_model_output <- function(model_output, model_name) {
  start_line <- grep("Coefficients", model_output)  # Locate coefficient section
  end_line <- length(model_output)  # Capture entire output
  cat("\n###", model_name, "###\n")
  cat(model_output[start_line:end_line], sep = "\n")  # Print relevant details
}

###### Print Model Outputs ######
extract_model_output(model_1_output, "SARIMA(4,1,1)(0,1,1)[12]")

### SARIMA(4,1,1)(0,1,1)[12] ###
Coefficients: 
     Estimate     SE  t.value p.value
ar1   -0.0330 0.1167  -0.2826  0.7777
ar2    0.0333 0.0915   0.3636  0.7164
ar3   -0.0312 0.0735  -0.4245  0.6715
ar4   -0.1885 0.0655  -2.8768  0.0043
ma1   -0.6737 0.1097  -6.1412  0.0000
sma1  -0.9438 0.0576 -16.3943  0.0000

sigma^2 estimated as 6778107200 on 281 degrees of freedom 
 
AIC = 25.61699  AICc = 25.61803  BIC = 25.70624 
 
Code
extract_model_output(model_2_output, "SARIMA(1,1,1)(0,1,1)[12]")

### SARIMA(1,1,1)(0,1,1)[12] ###
Coefficients: 
     Estimate     SE  t.value p.value
ar1    0.0417 0.0864   0.4820  0.6302
ma1   -0.7492 0.0622 -12.0427  0.0000
sma1  -0.9615 0.0784 -12.2688  0.0000

sigma^2 estimated as 6935294654 on 284 degrees of freedom 
 
AIC = 25.63012  AICc = 25.63042  BIC = 25.68113 
 
Code
extract_model_output(model_3_output, "SARIMA(0,1,4)(0,0,2)[12]")

### SARIMA(0,1,4)(0,0,2)[12] ###
Coefficients: 
          Estimate       SE  t.value p.value
ma1        -0.6610   0.0591 -11.1900  0.0000
ma2         0.0216   0.0681   0.3178  0.7509
ma3        -0.0746   0.0742  -1.0060  0.3153
ma4        -0.1645   0.0603  -2.7293  0.0067
sma1        0.1430   0.0578   2.4739  0.0139
sma2        0.1425   0.0595   2.3952  0.0172
constant 1547.8423 814.8587   1.8995  0.0585

sigma^2 estimated as 7398598646 on 292 degrees of freedom 
 
AIC = 25.6219  AICc = 25.62319  BIC = 25.72091 
 

Among the three candidate models, SARIMA(1,1,1)(0,1,1)[12] is selected as the most appropriate.

Although the (4,1,1)(0,1,1)[12] specification yields a slightly lower AIC, several of its AR coefficients are statistically insignificant, suggesting potential overfitting and reduced interpretability. Similarly, the (0,1,4)(0,0,2)[12] model captures short-term dynamics but fails to incorporate seasonal differencing (D=1), which is inconsistent with the pronounced seasonality of the TEUs series. In contrast, the (1,1,1)(0,1,1)[12] model achieves a favorable balance, with competitive AIC/BIC values, statistically significant parameters, and a parsimonious structure that adequately reflects both the underlying trend and seasonal dependence. Therefore, it represents the most robust and reliable choice among the three models.

Code
###### Fit ARIMA(1,0,2)(0,1,1)[12] Model ######
fit <- Arima(ts_df, order = c(1, 1, 1), seasonal =list( order=c(0, 1, 1), period=12))

###### Display Model Summary ######
summary(fit)
Series: ts_df 
ARIMA(1,1,1)(0,1,1)[12] 

Coefficients:
         ar1      ma1     sma1
      0.0417  -0.7492  -0.9615
s.e.  0.0864   0.0622   0.0784

sigma^2 = 7.009e+09:  log likelihood = -3673.92
AIC=7355.85   AICc=7355.99   BIC=7370.48

Training set error measures:
                    ME    RMSE      MAE       MPE     MAPE      MASE
Training set -2360.452 81454.1 47476.86 -315.7311 321.5409 0.5693415
                     ACF1
Training set -0.005805141
Code
# Forecast the next 365 periods
forecast_result <- forecast(fit, h = 36)

# Display forecast accuracy
accuracy(forecast_result)
                    ME    RMSE      MAE       MPE     MAPE      MASE
Training set -2360.452 81454.1 47476.86 -315.7311 321.5409 0.5693415
                     ACF1
Training set -0.005805141
Code
# Plot the forecast
autoplot(forecast_result) +
  labs(title = "ARIMA(1,1,1)x(0,1,1)[12] Forecast",
       x = "Time",
       y = "Predicted Values") +
  theme_minimal()

This plot presents the SARIMA(1,1,1)(0,1,1)[12] forecast for the next three years. The model suggests that container throughput at the Port of Los Angeles will remain at historically high levels, with clear seasonal fluctuations persisting. The mean forecast indicates relative stability without a strong upward or downward trend, while the widening confidence intervals highlight increasing uncertainty over time. Port activity is highly sensitive to global trade conditions, capacity constraints, and potential external shocks.

Code
###### Forecast for 36 Periods (3 Years) ######
forecast_horizon <- 36  # 36 months (3 years)

###### Plot Forecasts Using Mean, Naïve, Drift Methods, and ARIMA Fit ######
autoplot(ts_df) +
  autolayer(meanf(ts_df, h = forecast_horizon), series = "Mean", PI = FALSE) +
  autolayer(snaive(ts_df, h = forecast_horizon), series = "Naïve", PI = FALSE) +
  autolayer(rwf(ts_df, drift = TRUE, h = forecast_horizon), series = "Drift", PI = FALSE) +
  autolayer(forecast(fit, h = forecast_horizon), series = "ARIMA Fit", PI = FALSE) +
  ggtitle("TEUs Forecast (Next 3 Year)") +
  xlab("Year") + 
  ylab("TEUs") +
  guides(colour = guide_legend(title = "Forecast Methods")) +
  theme_minimal()

This plot compares SARIMA with alternative benchmark methods, including Drift, Mean, and Naïve forecasts. The Naïve model simply extends the last observation forward, failing to capture the seasonal structure of the series. The Mean method produces an overly smoothed, nearly constant forecast that lacks interpretive value for a dynamic system. The Drift model incorporates a trend component but similarly neglects the strong seasonal patterns present in the data. In contrast, the SARIMA model effectively captures both long-term dynamics and recurring seasonal cycles, producing forecasts that are more consistent with historical patterns.

As such, SARIMA provides a more realistic and reliable basis for understanding future container throughput compared to simpler benchmark methods.

Code
# Load required libraries
library(tsibble)
library(prophet)

prophet_data <- df_teu%>%
  select(Date, TEUs) %>%
  rename(ds = Date, y = TEUs )  # Rename for Prophet

###### Fit the Prophet Model ######
prophet_model <- prophet(prophet_data)

###### Extend the Forecast Horizon by 120 Months (10 Years) ######
future_dates <- make_future_dataframe(prophet_model, periods = 36, freq = "month")

###### Generate Forecasts ######
forecast_data <- predict(prophet_model, future_dates)

###### Plot Forecast ######
plot(prophet_model, forecast_data) +
  ggtitle("Forecast TEUs (Next 3 Years)") +
  xlab("Date") +
  ylab("TEUs")

This forecast model differs from the SARIMA approach in several key ways. First, SARIMA explicitly incorporates seasonal differencing and seasonal parameters, which allows it to capture recurring 12-month cycles more precisely. In contrast, the model shown here reflects seasonal fluctuations indirectly through its fitted trend and widening confidence intervals, rather than parameterizing seasonality directly. Second, the SARIMA forecasts are typically smoother and preserve a clear seasonal pattern over time, whereas this model also reproduces seasonal swings but with wider uncertainty bands, indicating greater caution about long-term stability. Finally, while SARIMA is particularly well-suited for datasets with strong and stable seasonal patterns, the current model is more flexible in balancing trend and seasonal dynamics, though it may be less robust for long-horizon forecasts.

Code
###### Forecast Next 12 Months Using SARIMA(1,0,2)(0,1,1)[12] ######
sarima.for(ts_df, 12, 1,1,1,0,1,1,12)

$pred
          Jan      Feb      Mar      Apr      May      Jun      Jul      Aug
2020 896182.2 804110.1 846735.9 883995.5 921785.0 909357.3 946982.0 977222.2
          Sep      Oct      Nov      Dec
2020 943340.7 951048.3 883549.4 899035.3

$se
           Jan       Feb       Mar       Apr       May       Jun       Jul
2020  83814.42  87323.12  90060.13  92691.96  95250.12  97741.31 100170.56
           Aug       Sep       Oct       Nov       Dec
2020 102542.27 104860.36 107128.30 109349.24 111526.52

The black line and points represent the historical container throughput (TEUs) from 2011 to 2020, showing notable fluctuations with an extreme dip around 2016. The red line and points indicate the forecasted values, which follow the historical pattern and maintain a relatively stable level without sharp deviations. The shaded gray area illustrates the confidence intervals, which widen over time, reflecting growing uncertainty in longer-term forecasts.

Overall, the model suggests that throughput at the Port of Los Angeles is expected to remain close to its historical average over the next year, with regular seasonal fluctuations but no significant structural shifts. The widening forecast intervals also highlight the potential influence of external economic, trade, or policy factors that could affect future outcomes.